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Abstract 

In this paper we propose methodology for inference of binary-valued adjacency matrices from 
various measures of the strength of association between pairs of network nodes, or more generally 
pairs of variables. This strength of association can be quantified by sample covariance and cor¬ 
relation matrices, and more generally by test-statistics and hypothesis test p-values from arbitrary 
distributions. Community detection methods such as block modelling typically require binary-valued 
adjacency matrices as a starting point. Hence, a main motivation for the methodology we propose 
is to obtain binary-valued adjacency matrices from such pairwise measures of strength of association 
between variables. The proposed methodology is applicable to large high-dimensional data-sets and 
is based on computationally efficient algorithms. We illustrate its utility in a range of contexts and 
data-sets. 


1 Introduction 


Networks and other non-Euclidean relational datasets have become important applications in modern 
statistics. Key considerations include balancing statistical hdelity with computational tractability. Much 
effort has gone into developing parametric models for networks which take account of such considera¬ 
tions, typically by specifying both node-specific effects such as degree, and grouped-node effects such 
as community structure [Holland et al.[ 1983[ Bickel and Chen 2009 Rohe et al. 2011 Qin and Rohe 


2013, Wilson et al. 20131. One of the most widely studied of these models is the stochastic blockmodel 


in which (under the assortative assumption) there is a greater probability of observing an edge (or in¬ 
teraction) between a pair of nodes (or entities) if they are in the same block, or community. Practical 
approaches to finding communities in social and biological networks have been studied for many years 
[Girvan and Newman 2002 , and real life examples of this problem include identifying groups of friends 
in social networks, and identifying functional subnetwork modules in biological networks. In the biologi¬ 
cal setting, considering groups of genes defined together as subgraphs can lead to increases in statistical 


power, aiding discovery of biological phenomena Jacob et al. 2012 Li and Li 2010 Peng et al. 2010 


There are important differences between community detection and clustering. A community within 
a network typically refers to a grouping of entities with a strong tendency for direct interaction within 
the group, such as a friendship group in a social network. On the other hand, a cluster typically refers 
to a group of variables which are highly correlated, but these variables do not necessarily represent 
entities which interact directly. However, practical application of community detection and clustering 
methodologies often yield similar results. The stochastic blockmodel is an efficient method to detect 
communities in networks, and more generally it can be used to cluster together variables with correlated 
observations. However, most of the important theoretical understanding of the stochastic blockmodel 
has been developed under the assumption of a binary-valued relationship between the network nodes 


Holland et al. 

1983 

Bickel and Chen 

2009 Rohe et al. 

2011 

Qin and Rohe 2013, [Wilson et al. 

2013 

Olhede and Wolfe 

2014 . This relationship corresponds to the presence and absence of network 


edges between these nodes, and is typically represented ones and zeros (respectively) in an adjacency 
matrix. If such theoretical understanding is to be relevant to the use of community detection / the 
stochastic blockmodel as a means of clustering, the data to be clustered must first be transformed into 
this binary-valued format. 

The methodology that we propose in this paper allows a binary-valued adjacency matrix to be esti¬ 
mated based on association matrices composed of sample covariances, or correlations, or test statistics 
from arbitrary known or unknown distributions. This binary-valued adjacency matrix is then an ideal 
summary of the relational data-set on which to carry out community detection. Hence, the main mo¬ 
tivation of this paper is to propose methodology to allow continuous-valued statistics which measure 
the strength of association between pairs of variables to be transformed into a binary-valued adjacency 


1 






































































matrix format, for use in community detection. In this format, ones and zeros can be considered to 
represent variables which are and are not correlated, respectively. 

If a binary-valued adjacency matrix is used to define pairs of variables which are correlated, and 
other pairs of variables which are not correlated, then the zero entries in this matrix define pairs of 
variables which are independent. This relates closely to the ‘probabilistic graphical model’ Koller and 


Friedman 2009 paradigm, in which a joint probability distribution over a large number of variables 


is made tractable by taking advantage of independencies between pairs of variables as specified by the 
graphical model. These ideas are also closely related to thresholding a covariance matrix to a sparse 
representation Bickel and Levina 2008 Rothman et al. 2009, Bien and Tibshirani 2011 , where again 


zeros in the sparse representation imply independent pairs of variables. Sparse multivariate methods such 
as the lasso Tibshirani, 1996 are also popular for obtaining sparse representations via linear modelling, 
and can be extended to networks data via the graphical lasso Friedman et al. 2008 . However the 


methodology proposed in this paper offers two main advantages over the lasso in this context. Firstly, 
the computational implementation is via a closed-form expression and therefore it is much quicker than 
the iterative procedures required by the lasso. Secondly, the mixture-modelling strategy we employ is 
precisely specified for the problem we consider here, unlike the lasso. 

This paper is organised as follows. In section we define notation and present the methodology and 
practical details for its usage and implementation. Then in section]^ we present examples to illustrate the 
performance of this methodology, including a simulation study and several real data-sets from different 
contexts. 


2 Proposed methodology 


We start this section by specifying the model which we will use to estimate the adjacency matrix A. 

Definition 1. For m € define the set of network nodes {1,...,to}, and for each node i define a 
corresponding variable Xi. Let Zij represent an observed measure of association/dependence between 
variables Xi and Xj, where: 

Zij ~ Af (^fXij , (T ) . 

Let A G {0, QYi adjacency matrix, the elements of which satisfy: 


Aij — < 


if there is no edge between nodes i and j, implying 
that the variables Xi and Xj are independent, 


if there is an edge between nodes i and j, implying 
that the variables Xi and xj are not independent, 


and let w = p{Aij = 1). 
mixture distribution: 


Then, the observed measures of association Zij 
Zij ~ (1 — w) • W (0, cr^) + w ■ Af [pij,a^) . 


may be modelled using the 


( 1 ) 


In section 2.1 we describe how to calculate the observed measures of association/dependence Zij from 
sample covariance/correlation matrices. Then, in section 2.2 we describe the equivalent calculations 
based on test statistics from arbitrary or unknown distributions. Next, in section |2.3| we describe how 
the model of definitioncan be fitted, and how the adjacency matrix A can be estimated from the fitted 


model. Then in section 


2.4 


we discuss community detection based on A. 


2.1 Applying the model to a covariance/correlation matrix 

We can estimate an adjacency matrix from a sample covariance or correlation matrix by fitting the model 
of definition [2 by starting with the following procedure. Equation [^defines the sample covariance matrix 
S for the m variables represented by the vector x, xi, ...,Xm, for samples x(A:), k = 1, ...,n: 

S = —^ (x(A:) — x) (x(fc) — x)"^, where 5c=—'^x{k). (2) 

^ k=l ^ k=l 
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By dividing each row and each column of S by the square roots of the corresponding elements of the 
leading diagonal, we obtain the sample correlation matrix r: 


(diag(S)) S (diag(i:)) 


r = 


- 1/2 


The element of f, i.e. fij, is the Pearson correlation coefficient between variables Xi and Xj. If 

Xi and Xj are jointly normally distributed, and the {xi{k),Xj{k)}, k = 1, ...,n samples are independent. 


the Fisher transform Fisher 


1915 


converts Vij to the approximately normally distributed variable z. 


n- 


1 


Zj-i = - In , 

^ 2 V 1 - r. 


1 + f. 


( 3 ) 


where 


AA( xln 


1 + r^. 


1 — ra I ’ z/ — 3 


where Xij is the true correlation coefficient between variables Xi and Xj , and v is the degrees of freedom. 
Hence, we can model the Fisher-transformed sample correlation coefficients Zij with the mixture model 
of equation also with: 



1 + fjj 

1 - nj 


and tr^ 


1 

— 3 ’ 


( 4 ) 


2.2 Applying the model to test statistics from arbitrary distributions 

We can also estimate an adjacency matrix by fitting the model of definition when the association 
between variables Xi and Xj is assessed by a test-statistic from an arbitrary distribution expressed as 
a hypothesis-test p-value. Such a p-values may result from test-statistics from any known distribution, 
or may even be derived from an unknown distribution, for example by Monte-Carlo simulation. We 
can represent these p-values in the matrix P, where pij is the estimated probability of observing the 
association test-statistic for the pair of variables Xi and Xj under the null hypothesis Hq that there 
is no association between Xi and Xj (i.e. they are independent). Assuming these p-values arose from 
upper-tailed tests, we can apply the inverse-normal transformation as follows: 

Zy = (1-Py ), (5) 

with an equivalent expression available for lower-tailed tests. Applying this transformation is equivalent 
to applying quantile normalisation, mapping the null distribution oipij onto the standard normal A (0,1) 
distribution. Hence, after applying this transformation we can again fit the mixture model of definition 
and use this model fit to infer the estimated adjacency matrix A. 


2.3 Model fitting and adjacency matrix inference 


We propose fitting the model of defini tion [T] with an empirical Bayes procedure used previously for 
thresholding Johnstone and Silverman 20041. This method is based on a mixture prior over pij, with a 
Laplace density for the non-zero mean component. 

Definition 2. With pij and w given hy definition^ let 7 (•) represent the Laplace distribution probability 
density function with spread parameter a: 

liPi]) = ^exp(-alpyl). 

Then, the mixture prior over pij is defined as: 

f prior ihij') ~ (t ^ i.Tij') T (.Tij) • 

Typically the Laplace spread parameter is taken as a = 0.5. If the mixture components have Gaussian 
likelihoods fx (j/rij,cr^) as in definitionit follows from definitionthat the posterior density over 
the observed measures of association z^ is: 


/"posterior ) — 


(1 - Wi)5{p^j) fjg (zijj0,a^) + [pij) fjg {zij\p,ij, 
/marginal {,Zij ) 
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where the marginal density is: 


/marginal i^ij) — (1 '^i)fN ^ ) “t i^ij) : 


( 6 ) 


where g {g,ij ) is the convolution of the Laplace density with the standard normal density. Comparing 
the expression for /marginal {zij) in equation [6| with equationwe see that the normally-distributed non¬ 
zero mean mixture component in equation f^s replaced with the convolution of this Laplace and normal 
densities in equatio n If a Gaussian prior were used here instead of the Laplace prior, then the marginal 
density in equation [djwould be exactly the same as equation]^ However, as noted previously Johnstone 


and Silverman, 2004 , this empirical Bayes procedure requires a prior with tails that are exponential or 


heavier. Hence we use, as previously, the Laplace rather than a Gaussian prior. We note that this is a 
slight model mis-specification. 

This procedure results in a separate model being fitted to each pair of variables (ai^, Xj), based on the 
corresponding observed statistic Zij. This methodology was originally developed to be applied to vector 
data (in the form of wavelet coefficients) [Johnstone and Silverman] 2004 . Because the dependency 


structure of matrix data (such as covariance or correlation matrices) may be different to that of vector 
data, we apply the model fitting to each row of the association matrix, i.e. a vector, separately. As 
the association matrices under consideration are symmetric, this is equivalent to applying the method 
to both rows and columns of the matrix. We then take a conservative estimate, only inferring an edge 
in the network when there is agreement between the result of model fitting with respect to both rows 
and columns of the association matrix. Applying the methodology in this way results in a common 
weight Wi being used for all models corresponding to each Xi. This estimate of Wi is found as the value 
which maximises the marginal likelihood (equation of the observed statistics Zij over all the pairwise 
comparisons of Xi with Xj, j ^ i. This allows the model for each pairwise comparison (xi,Xj) to ‘borrow 
strength’ from all the other comparisons {xi,Xj>), j' ^ i, j' yf j: 


w. 


: arg max 

W 


log {(1 - w)(j) (zij) + wg {zij)} . 


(7) 


For a particular Xi, if the Zij are mostly close to zero then Wi will be set low, which means that fewer 
edges {Aij = 1) will be detected: this corresponds to i being a low-degree node. If for a different Xi, 
the Zij are generally further from zero, then Wi will be set high, which corresponds to more edges being 
detected: this corresponds to i being a high-degree node. Hence, setting Wi separately for each variable 
Xi allows adaptation to a heterogenous degree distribution in A. 

we use the posterior 


As in the original use of this methodology Johnstone and Silverman, 2004 


median to calculate /ty . Based on this, we can estimate the corresponding adjacency matrix entry Aij 
as: 


If I I ^ 0 5 


Aij —1 


Aij =0 otherwise. 


( 8 ) 


We make the conservative estimate of Aij discussed above as follows: 


Aij —1 if iMiyl ^ 0 and ^ 1^; 


(9) 


Aij =0 otherwise. 


We note that requiring agreement between l/tyj > 0 and \jj,ji\ > 0 is likely to result in decreased 


sensitivity: this point is discussed further in section |3T] the context of the simulation study. The spread 
parameter a in the Laplace prior is set as standard as a = 0.5 [Johnstone and Silverman 2004 . However, 
for additional model flexibility where needed, a can also be estimated by marginal maximum likelihood, 
in which case we estimate separately for each variable Xi, simultaneously with Wi. 


2.4 Community detection 


Having inferred A, community detection Girvan and Newman 

2002 may then proceed by fitting the 

degree-corrected stochastic blockmodel Holland et al. 

1983 

|Bickel and Chen 

2009 

Rohe et al. 

2011 


Qin and Rohe 2013 directly to A. However, to fit the degree-corrected stochastic blockmodel the number 
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of communities in the model, T, must first be specified; this number can be estimated by the ‘network 
histogram’ method Olhede and Wolfe, 2014 . Using this estimate of the number of communities 


we 


infer the set of communities C based on A, such that a community Ct G C, t G {1, is a group of 

variables Xi, i G Ct- Such a community Ct would correspond to an unexpectedly large number of non-zero 
entries > 0 of the sample covariance matrix S for pairs of variables Xi and Xj where i G Ct and 
j G Ct- Alternatively, the community Ct would correspond to an unexpectedly large number of significant 
p-values pij in the matrix P for pairs of variables Xi and Xj again with i G Ct and j G Ct- 


3 Examples 


In this section, we present the results of applying the methodology proposed in section to simulated 
data, and to publicly available data-sets relevant to genomics and consumer-product reviews. For each 
data-set, we carry out network inference as described in sections |2.1| - resulting in a binary-valued 
adjacency matrix. To each such adjacency matrix, we fit the degree-corrected stochastic blockmodel, by 
regularised spectral clustering Holland et al., 1983 Bickel and Chen 2009 Rohe et al.} 2011 Qin and 


Rohe 2013 


Spectral clustering is in general computationally intensive, as it requires the singular value 

[23 


decomposition (SVD) of a large matrix. However, the network inference described in sections 2.1 


provides us with a sparse binary-valued adjacency matrix, and efficient computational methods exist to 


find the top few components in the SVD of large sparse matrices Sprensen 1992, Lehoucq and Sprensen 
1996 . Hence, as we only require as many SVD components as the number of communities or clusters 


we are trying to find (which tends to be two or more orders of magnitude smaller than the dimension 
of the adjacency matrix, m), these efficient computational methods can be used here. Relevant software 
implementations of these methods are included in Matlah and R, meaning that this methodology is 
practical for large data-sets, and is quick to implement for many end-users. 


3.1 Simulation study 

We first carried out a simulation study, to assess the effectiveness of our network inference methodology in 
the context of generated networks with known community structure. A generative model for exchangeable 
random networks with heterogenous degrees is the logistic-linear model Perry and Wolfe 2012 . We use 
a version of that model here with community structure added, defined as: 

Logit (p^j) =ai+ aj + 

where pij defines the probability of an edge being observed between nodes i and j. We choose to use 
this model, because the parameters can take any real values, whilst the the edge probabilities pij are 
guaranteed to lie between 0 and 1. This model only deviates from the equivalent log model when the 
parameter values become very large - it is this effect that prevents pij from reaching (and exceeding) 
1. The node-specific parameters Oi, i G 1, ...,m are elements of the parameter vector a which defines 
a power-law degree-distribution for the nodes. Each ai is generated as the logarithm of a sample taken 
from a bounded Pareto distribution 


Olhede and Wolfe, 2012 


to be random, our generated networks are exchangeable Kallenberg, 2005 


We note that because our ai are chosen 
whereas if the elements of 


a were defined deterministically, these networks would instead be generated under the inhomogenous 

The community parameter 0^ is allowed to take two values: 


random graph model Bollobas et al. 


2007 


Qij = Gin if i and j are in the same community, and Oij — 0out otherwise. We choose to constrain 0y in 
this way because it is a simple means of adding community structure, and it is equivalent to a modelling 
constraint which improves parameter identifiability in some formulations of the stochastic blockmodel 
After generating the Pij, the network is generated by sampling each according to 


Newman 2013 


the law of: 


Bernouilli {pij ). 


The communities themselves are planted in the network as randomly chosen groups of 150 nodes. We set 
the number of communities k = 20, and hence the generated networks each comprise m = 3000 nodes. 

Having generated a network with known ground-truth community structure in this way, we use it 
to randomly generate a sample correlation matrix f, from which we attempt to reproduce the known 
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community structure. To do this, we first generate a random sample covariance matrix S^- for each pair 
of nodes i and j, according to: 

Sij ~ Wishart (S, v) 


where 


S = 


1 r, 


gen 

1 


if Aij = 1, where rgen is the model generative correlation coefficient, and 


S = 


1 0 
0 1 


if Aij = 0, and v is the degrees of freedom. We then calculate the estimate of the sample Pearson correla¬ 


tion coefficient for nodes i and j as ^ X ^ ^ ^ X 

With all elements of f generated in this way, with = fji and fa = 0 for i,j € {1, ...,m}, we proceed 

with network inference and community detection according to the methodology set out in section]^ 

We test the proposed methodology on networks generated with values of 6*in S {50,30, 20,10}, which 
correspond to within-community edge density pin G {0.81,0.34,0.15,0.039}. For all networks, we set 
Oont = 1) corresponding to between-community edge density pout = 0.0013. We generate sample covari¬ 
ance matrices with rgen G (Oil]; and degrees of freedom v G {50,100,200}. For each combination of 
parameters, we carry out 50 repetitions of network generation followed by network inference and com¬ 
munity detection. These repetitions enable assessment of the variability of the accuracy of the network 
inference. To compare detected communities in the inferred network with the ground-truth planted 
communities, we use the normalised mutual information (NMI) [Danon et al.[ |2005| . The NMI assesses 
the numbers of nodes which appear together in the detected communities, compared with whether they 
appeared together in the planted communities (adjusted for group sizes). The NMI takes the value 1 if 
the communities are perfectly reproduced in the community detection, and 0 if they are not reproduced 
at all, and somewhere in between if they are partially reproduced. 

The results of the simulation study are shown in Figure The accuracy of reproduction of the 
ground-truth community structure is high (as evidenced by NMI values close to 1), if the generative 
correlation coefficient rgen is sufficiently large. There is rapid deterioration of performance below the 
optimal range of Cgen, and when rgen is sufficiently low, no edges are detected. In this regime, the non-zero 
mean component of the generative mixture model is centred sufficiently close to zero that the Zij from this 
component become categorised together with those from the zero-mean mixture component. The result 
is that the model fitting effectively assigns all Zij to the zero-mean component. However, as long as the 
generative correlation coefficient rgen is sufficiently large, the method performs well even with fairly sparse 
within-community edge density in the ground-truth planted communities. Typically, the method fails 
when rgen falls below roughly 0.45, 0.35 and 0.25 for v = 50, v = 100 and v = 200, respectively. In the 
regime where the method is close to failing, there is an apparent increase in performance before complete 
failure, which manifests as the spikes in NMI values seen in in Figure in the range 0.3 < pgen < 0.4. 
This phenomenon occurs because in this regime, there is a transition from mainly larger communities 
being detected to many more smaller communities being detected, as evidenced by a decrease in the mode 
of the distribution of detected community sizes (Supplementary Figure SI). Community size is initially 
maintained in this regime as pgen is decreased below 0.4, and the corresponding decline in performance 
occurs because these larger communities only partially overlap with the ground-truth communities. As 
Pgen is decreased further and gets close to the point where the methodology will fail completely, fewer 
edges are detected overall leading to the larger communities breaking up into many small communities. 
These small communities are mostly subsets of the the ground-truth communities, and this is reflected in 
the higher NMI values. As pgen is decreased beyond this regime, no edges are detected and the method 
fails completely. We also note that for large values of rgen, the performance of the methodology is slightly 
worse for the largest values of pin- The planted ground-truth communities each comprise 150 nodes, and 
this decrease in performance occurs because in this regime several of these communities coalesce in the 
inferred network to form a much larger connected component (Supplementary Figure S2). This is likely 
to be due to the higher false-positive rate in this regime (Supplementary Figure S4) leading to spurious 
connections between communities. 
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Figure 1: Simulation study: performance of proposed methodology. 

Normalised mutual information (NMI) compares deteeted community structure with ground-truth planted com¬ 
munities. Each line corresponds to a different within-community edge-density; these are set as pin € 
{0.81,0.34,0.15,0.039} by setting 6in € {50,30,20,10}. The degrees of freedom, v, are set as v G {200,100,50}. 
For each network, the number of nodes m = 3000, the ground-truth number of communities is k — 20, and the 
between-community edge density is set as pout = 0.0013 by setting Bout = 1. Dashed lines indicated quartiles. 


The thresholding methodology which underlies the proposed methodology of section|2.3|was originally 


developed in the context of thresholding data vectors Johnstone and Silverman, 2004 . Applying this 


methodology to relational data matrices such as covariance and correlation matrices is complicated by 
the presence of additional dependency structure, and to mitigate spurious detection, the conservative 
adjacency matrix estimate of equation is used. To check the performance of the methodology in this 
context of adjacency matrix thresholding against the intended vector thresholding application, we carried 
out comparative true positive rate (sensitivity) and false positive rate (1-specificity) analyses. For these 
analyses the same simulated data is considered as is presented in Figure and the results appear in the 
supplement in Figures S3 and S4. True and false positive rates are calculated for the adjacency matrix 
inference presented in sections |2.1| - 12.3| and these results are labelled ‘matrix’ in Figures S3 and S4. 
The equivalent results based on equation are also recorded for each row of the thresholded adjacency 
matrix before applying the conservative estimate of equation and the means of these over each row 
of the adjacency matrix are also shown in Figures S3 and S4 and labelled ‘vector’. The true positive 
rate is only slightly lower for adjacency matrix inference than for vector thresholding, except when pi„ 
is lowest. The false positive rate is close to zero in all cases, although it is apparently sufficiently great 
for the largest values of O-m and pm to cause spurious coalescence of some communities, as discussed. 
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3.2 Comparison with popular clustering methods 


The clustering problem is fundamentally different to that of community detection, although there are 
nevertheless many similarities. The basic task of clustering is to group together entities (usually variables 
or samples) based on their similarity or distance from one another in observation space, which can 
assessed by, for example, Pearson correlation. When the entities being grouped are nodes in a network, 
the problems of clustering and community detection are very similar. In this study, we infer binary¬ 
valued networks from continuous data before carrying out community detection. However, a number of 
popular methods provide alternative means of clustering entities into groups (which may be considered 
equivalent to communities) based on continuous data. 



v = 200 



o I I I I 

0.0 0.2 0.4 0.6 0.8 1.0 

I'gen 


Pin = 0.8 
Pin = 0.34 
Pin = 0.15 
Pin = 0.039 


Figure 2: Simulation study: spectral clustering without network inference. 

Normalised mutual information (NMI) compares detected community structure with ground-truth planted com¬ 
munities. Each line corresponds to a different within-community edge-density; these are set as pin G 
{0.81,0.34,0.15,0.039} by setting 9in G {50,30,20,10}. The degrees of freedom, v, are set as u £ {200,100,50}. 
For each network, the number of nodes m = 3000, the ground-truth number of communities is k = 20, and the 
between-community edge density is set as pout = 0.0013 by setting 9out = 1. Dashed lines indicated quartiles. 


A method of clustering which is very popular across the biological and social sciences is hierarchical 
clustering. In that method, variables or samples are grouped together according to their distance from 
one another. A popular measure of distance between a pair i and j of such variables or samples is 
1 — \ fij\, where \ fij \ is the absolute value of the Pearson correlation coefficient between i and j estimated 
from the available observations. Hence, this method can be easily applied to data of the type presented 
here (without carrying out the network inference presented in section 2.3). We tested this method on 
the simulated data presented in section 3.1 by applying hierarchical clustering to the generated sample 
correlation matrix r before comparing the detected clusters with the planted communities. However, we 
found that in every case, the result of this comparison was an NMI value close to 0. Therefore, we may 
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conclude that hierarchical clustering performs significantly worse than the methods presented here on 
problems of this type. 

One of the most popular clustering methods is iiT-means. In that method samples (which may be 
thought of as equivalent to network nodes) are grouped into K clusters based on their location in N- 
dimensional space. On its own, this method is fundamentally ill-suited to network data because of the 
high dimensionality of the problem. However, if-means clustering is often used in spectral clustering after 
dimension reduction by SVD: we use that method of spectral clustering in this paper to fit the stochastic 
blockmodel. Spectral clustering can also be used to cluster continuous data, and so for comparison we 
have applied regular spectral clustering (without carrying out the network inference described in sections 
2.1 - 2.3) to the simulated data presented in section 3.1 To do this, we applied spectral clustering as 
described at the start of section [^directly to |f|, the absolute of the generated sample correlation matrix 
(i.e. to continuous data). The absolute values are used to ensure that the data is non-negative, as required 
for spectral clustering [Von Luxburg[ [200^ . The results appear in Figure Spectral clustering applied 
directly to f is generally less accurate (according to the NMI) than if the network inference/thresholding 
of sections |2.1|-|2.3| is first applied (Figure]^. One exception when spectral clustering applied directly 


to f is more accurate occurs when rgen is lowest, as in that regime the problem of total failure of the 


network inference/thresholding (as discussed in section 3.1) is avoided. Another such exception occurs 
when Pin is highest and rgen is large. The reason is that in this regime, the phenomenon of the ground- 
truth clusters/communities coalescing due to false positives caused by the network inference/thresholding 


(also as discussed in section 3.1) is avoided. However in general, for problems of the type presented here, 
applying the network inference/thresholding of sections 2.1 - |2.3| prior to carrying out spectral clustering 
produces more accurate results. Furthermore, as this network inference/thresholding generally results in 
a sparse adjacency matrix, it allows use of efficient computational methods to find the top components 
in the SVD which are required for spectral clustering. 


3.3 Genomics example 

We now give an illustrative example of a practical application of these methods to a standard problem 
in genomics. Community detection can be used to infer groups of genes which comprise functional 
subnetwork modules, or groups of co-regulated genes. Examples of such groups are found in gene 

Defining x(/c) to be the gene 


regulatory networks and protein signalling networks Shen-Orr et al. 2002 


expression measurements for sample k for the genes xi,X 2 , we calculate the covariance matrix 

according to equationand carry out network inference as described in sections |2T] - |2.3| We note that 
the network edges detected in this way may be transitive edges, i.e. they do not necessarily represent 
physical interactions between genes and gene products. To determine this would require additional 
functional data, such as those relating to DNA binding by gene products (e.g., transcription factors) 
Jojic et al. 2013 . However, in general the groups of genes detected in this way can be expected to form 


biologically meaningful subnetwork modules, generating biological hypotheses which may justify further 
investigation by experimental scientists. 

We carried out this process of network inference and community detection in gene expression data 
from 8 different types of cancer: brain, breast, colon, kidney, lung, ovarian, rectal and uterine (data 
source: The Cancer Genome Atlas [Hampton 2006j ). Each data set comprises gene expression measure¬ 
ments for 17505 genes (i.e., m = 17505). Figure [^ shows the inferred adjacency matrix after community 
detection for the lung cancer data-set. The number of communities is estimated as 105 by the network 
histogram method Qlhede and Wolf^ 2014j for this data-set, and the edge density is p = 0.062 (which 
is typical of all 8 gene expression datasets). 

We also tested the domain-relevance of the communities detected in the inferred networks. We tested 
the overlap of the genes of each detected community separately with each of 10295 known gene-groups 
(data source: http://www.broadinstitute.org/gsea/msigdb/). This is known as ‘gene set enrichment anal¬ 
ysis’ (GSEA) jSubramanian et 2005] . Table shows the percentage of the communities detected 
in each cancer data-set which overlapped significantly with at least one of these known gene-groups. 
For this purpose, significance is assessed by Fisher’s exact test, with the significance level set by FDR 
(false discovery rate) adjusted p < 0.05. As a benchmark, we also sampled random groups of genes from 
the 17505 genes represented in the cancer data-sets, and tested them for overlap with the same 10295 
known gene-groups. The number of genes in each random sample was itself randomly sampled from the 
distribution of the sizes of the communities detected in the cancer data-sets. We took 1000 randomly 
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Figure 3: Detected communities in a gene expression data set, relating to lung cancer. 

Entries in the adjacency matrix equal to 1 (representing a network edge) are coloured blue, and detected commu¬ 
nities are outlined in black. 


Breast 

Colon 

Brain 

Kidney 

Lung 

Ovarian 

Renal 

Uterine 

97% 

86% 

87% 

76% 

89% 

96% 

76% 

66% 


Table 1: Domain-relevance of detected communities in the genomics example. 

The table shows the percentage of the communities detected in each cancer data-set which overlap significantly 
(Fisher’s exact test, FDR-adjusted p < 0.05j with at least one known gene group. 

sampled groups of genes like this, of which 2% overlapped significantly with at least one of the known 
gene-groups. These results show a high level of domain-relevance of the detected communities, in all 8 
genomics data-sets analysed here. 

3.4 Consumer product review example 

We now give a second, contrasting illustrative example of a practical application of these methods to 
real data, based on a consumer-product review dataset. We downloaded movie review data from the 
Movie Lens database, which details 1 000 209 reviews of 3952 different movies, by 6040 unique users 
who each provided at least 20 different reviews (data source: http://grouplens.org/datasets/movielens/). 
Covariate information is also available, classifying each user into one of 7 age groups and 20 professions; 
this can be used to verify the detected communities/clusters. 

For each pair of users (i,j), we tested the overlap of the movies reviewed by user i with the movies 
reviewed by user j with Fisher’s exact test. This provided an estimated p-value for each pair of users 
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Pij , under the null hypothesis that there is no significant overlap between the movies reviewed by users i 
and j. These are a one-tailed test p-values corresponding to an alternative hypothesis that there is more 
overlap between movies reviewed by users i and j than would be expected by chance. Then, we applied 



granularity of this estimate is much greater than that of the covariate information we have available for 
verification of detected clusters. The network histogram method estimates the optimal granularity for 
the stochastic blockmodel, however we can also select a smaller number of communities with which to fit 
the stochastic blockmodel, whilst noting that this will not result in the optimal blockmodel as assessed 
by the mean squared integrated error (MISE) [Olhede and Wolfe 2014 . We selected 15 communities for 
the blockmodel, which is of the same order as the number of covariate classes, but chosen to be less than 
the total number of classes to take account of the fact that many of these classes are overlapping. The 
edge density p for the inferred adjacency matrix A is calculated as p = 0.16, which is relatively high. 

Figure shows the inferred adjacency matrix after community detection. The detected communi¬ 
ties are tested for overlap with the known covariate groups; those which overlap significantly (Fisher’s 
exact test, FDR-corrected p < 0.05) are specified along the margin. Almost all of the detected com¬ 
munities/clusters overlap with at least one covariate group, and several communities/clusters overlap 
with multiple covariate groups. Where the overlap is with multiple covariate groups, there is generally 
an obvious link between these groups, such as similar age groups, or professions which suggest similar 
demographic groups. These findings show that this methodology is very effective in the context of this 
example, in which we obtain A from an arbitrary non-Gaussian distribution, based on corresponding 
p-values of association between pairs of variables {xi,Xj). 


4 Conclusion 


In this paper, we have proposed methodology combining estimation of binary-valued adjacency matrices 
with community detection via the stochastic blockmodel, based on sample covariance and correlation 
matrices or more general test statistics quantifying association between pairs of variables. We have 
presented the theoretical basis for this proposed methodology, and provided practical details for its im¬ 
plementation. We have shown the accuracy of this methodology in the context of a simulation study, 
and have shown its effectiveness in several contexts based on multiple real data-sets, with a range of 
sparsities. We have also shown that this methodology performs better than popular clustering methods 
for discovering latent groupings in data of the type presented here. An important point to note, is that 
some network edges inferred from the correlation structure of data as in the methodology proposed here 
may be what are often referred to as ‘transitive edges’. I.e., an inferred edge may not correspond to a di¬ 
rect physical real-life interaction, instead deriving from some indirect interaction which may alternatively 
be mediated via a less direct route through the network, possibly also involving unobserved variables. 
Interesting extensions to this methodology include consideration of overlapping blocks in the stochastic 
blockmodel Latouche et al. 2011 , and development of an online version of the methodology as a com¬ 


putationally efficient approach to large and growing data-sets [Zanghi et al. 2010 . This methodology 


would be expected to work equally well in many other networks contexts, and in more general scenarios 
where the aim is to cluster together correlated variables. This methodology can be implemented using 
readily available and computationally efficient algorithms, and performs well on large high-dimensional 
datasets. 
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Figure 4- Detected communities in the movie review data set. 

Entries in the adjacency matrix equal to 1 (representing a network edge) are coloured blue, and detected commu¬ 
nities are outlined in black. 
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